DataFrame and Data Cleaning Python Programming - Java, Java Swing, OOAD, MIS, DSA

DataFrame and Data Cleaning

In Python, a DataFrame is a 2-dimensional, labeled data structure provided by the pandas library. It is one of the most commonly used structures for data analysis and data science.

A DataFrame:

Stores data in rows and columns
Each column can have a different data type
Has row labels (index) and column names
Is mutable (can be changed)

import pandas as pd

data = {
    "Name": ["Rocky", "John", "Anna"],
    "Age": [21, 22, 23],
    "Marks": [85, 90, 88]
}

df = pd.DataFrame(data)
print(df)

data = [
    ["Sheetal", 21, 85],
    ["Shyam", 22, 90],
    ["Meena", 23, 88]
]

df = pd.DataFrame(data, columns=["Name", "Age", "Marks"])
print(df)
#access column
print(df["Name"])
#access specific row and column
print(df.loc[0, "Marks"])

df.head()    # first 5 rows
df.tail()    # last 5 rows
df.shape     # (rows, columns)
df.columns   # column names
df.info()    # summary

We can add new column to existing dataframe :

df["Grade"] = ["A", "A+", "A"]

Modify Data

df["Marks"] = df["Marks"] + 5

Delete column

df.drop("Age", axis=1, inplace=True)

Filtering Data

high_marks = df[df["Marks"] > 85]
print(high_marks)

Why to use Python DataFrame?

Easy data manipulation
Powerful filtering and grouping
Used in Machine Learning, AI, Data Analytics
Works well with CSV, Excel, SQL databases

    df = pd.DataFrame({
    "Product": ["Laptop", "Mobile", "Tablet"],
    "Price": [80000, 30000, 20000]
    })

    print(df[df["Price"] > 25000])

Data Cleaning

Data Cleaning in a DataFrame means detecting and correcting errors, missing values, duplicates, or inconsistent data so that the dataset becomes accurate and ready for analysis.

Check Missing Values

    import pandas as pd

    df = pd.read_csv("data.csv")

    print(df.isnull())
    print(df.isnull().sum())

Remove Rows with Missing Values

df["Age"] = df["Age"].fillna(df["Age"].mean())

Check Duplicates

df.duplicated()

Remove Duplicates

df = df.drop_duplicates()

Check Data Types

print(df.dtypes)

Renaming Columns

df.rename(columns={"old_name": "new_name"}, inplace=True)

Removing Unnecessary Columns

df.drop("Address", axis=1, inplace=True)

Example filtering outliers:

df = df[df["Age"] < 100]

Standardizing Text Data

df["Name"] = df["Name"].str.lower() df["Name"] = df["Name"].str.strip()

Simple Example of Data Cleaning

    import pandas as pd

    data = {
        "Name": ["Ram", "Hari", "Hari", None],
        "Age": [20, None, 22, 21],
        "Marks": [80, 90, 90, 85]
    }

    df = pd.DataFrame(data)

    # Fill missing values
    df["Age"].fillna(df["Age"].mean(), inplace=True)

    # Remove duplicates
    df.drop_duplicates(inplace=True)

    # Remove rows with missing names
    df.dropna(subset=["Name"], inplace=True)

    print(df)

Online-Academy

Look, Read, Understand, Apply

Data Cleaning